##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## corrplot 0.84 loaded
QUESTION: For those who pay tip by credit card, what factors influence the amount of tip given? Should you pay tip to the taxi driver? If yes, how much?
Analyse the factors which takes place to determine the tip amount paid to the driver
Why? Benefits? Use?
Hypothesis: Factors that can affect tip amount: 1. Vendor service 2. Driver service 3. Location : (generous locations) 4. duration of day 5. distance 6. count of passengers 7. weather
write where data is collected from
unprocessed_data = read.csv("../Data/trips_zones_20000_v2.csv")
taxi_zones <- read.csv("../Data/taxi_zone_lookup.csv")
glimpse(unprocessed_data)## Observations: 20,000
## Variables: 19
## $ X <int> 4524218, 6458048, 3369795, 18532, 1743670,…
## $ DOLocationID <int> 211, 249, 161, 4, 107, 246, 237, 125, 142,…
## $ PULocationID <int> 90, 125, 68, 87, 234, 230, 163, 249, 236, …
## $ VendorID <int> 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
## $ tpep_pickup_datetime <fct> 6/16/19 0:15, 6/28/19 0:09, 6/14/19 23:04,…
## $ tpep_dropoff_datetime <fct> 6/16/19 0:28, 6/28/19 0:16, 6/14/19 23:22,…
## $ passenger_count <int> 1, 1, 2, 1, 1, 6, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ trip_distance <dbl> 1.60, 1.12, 2.72, 2.90, 0.62, 1.90, 0.96, …
## $ RatecodeID <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ store_and_fwd_flag <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
## $ payment_type <int> 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, …
## $ fare_amount <dbl> 10.0, 6.5, 13.5, 11.0, 5.0, 11.0, 7.0, 7.0…
## $ extra <dbl> 0.5, 0.5, 0.5, 3.5, 0.5, 0.5, 0.0, 3.0, 0.…
## $ mta_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount <dbl> 0.00, 2.06, 2.60, 0.00, 1.76, 0.00, 1.00, …
## $ tolls_amount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.…
## $ total_amount <dbl> 13.80, 12.36, 19.90, 15.30, 10.56, 14.80, …
## $ congestion_surcharge <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.…
## Observations: 265
## Variables: 4
## $ LocationID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Borough <fct> EWR, Queens, Bronx, Manhattan, Staten Island, State…
## $ Zone <fct> Newark Airport, Jamaica Bay, Allerton/Pelham Garden…
## $ service_zone <fct> EWR, Boro Zone, Boro Zone, Yellow Zone, Boro Zone, …
The dataset provides a location ID that corresponds to a taxi zone in each of the five boroughs. These nominal variables do not provide much value in its integer format since we do not know the geographical locations of each location ID. We downloaded a taxi zone and ID dataset that provides the boroughs for each location ID. The dataset also indicates the specific neighborhoods within each borough. We merged that dataset to the taxi dataset to identify the borough for both pick up and drop off.
unprocessed_data <- merge(x = unprocessed_data, y = taxi_zones, by.x = "PULocationID", by.y = "LocationID", all.x = TRUE)
unprocessed_data <- subset(unprocessed_data, select = -c(Zone, service_zone))
colnames(unprocessed_data)[colnames(unprocessed_data)=="Borough"] <- "Borough_pu"
unprocessed_data <- merge(x = unprocessed_data, y = taxi_zones, by.x = "DOLocationID", by.y = "LocationID", all.x = TRUE)
colnames(unprocessed_data)[colnames(unprocessed_data)=="Borough"] <- "Borough_do"
unique(unprocessed_data$Borough_do)## [1] EWR Bronx Manhattan Queens Brooklyn
## [6] Staten Island Unknown
## Levels: Bronx Brooklyn EWR Manhattan Queens Staten Island Unknown
Looking at the distribution of raw tip amount, it is clear that it is not normally distributed.
## Error in boxplot(processed_df$tip_amount): object 'processed_df' not found
## Error in qqnorm(processed_df$tip_amount): object 'processed_df' not found
## Error in quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE): object 'processed_df' not found
A normal distribution is often an assumption for many statistical analyses. Generally. raw tip amounts vary because the fare amounts vary. One factor that may not necessarily vary is tipping percentage. In the US, there is often a standardized percentage that a customer gives (for example, 15% at restaurants). We divided the fare amount by the tip amount to obtain a tipping percentage:
per_tip creation
# get percentage tip amount
unprocessed_data['per_tip'] = unprocessed_data['tip_amount'] / unprocessed_data['fare_amount']
summary(unprocessed_data)## DOLocationID PULocationID X VendorID
## Min. : 1.0 Min. : 3.0 Min. : 1016 Min. :1.000
## 1st Qu.:107.0 1st Qu.:114.0 1st Qu.:1725764 1st Qu.:1.000
## Median :162.0 Median :161.0 Median :3459818 Median :2.000
## Mean :160.4 Mean :161.9 Mean :3461833 Mean :1.642
## 3rd Qu.:233.0 3rd Qu.:233.0 3rd Qu.:5200940 3rd Qu.:2.000
## Max. :265.0 Max. :265.0 Max. :6940096 Max. :4.000
##
## tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## 6/11/19 7:56 : 7 6/24/19 18:36: 6 Min. :0.000
## 6/14/19 13:28: 6 6/27/19 21:45: 6 1st Qu.:1.000
## 6/3/19 15:08 : 6 6/29/19 0:26 : 6 Median :1.000
## 6/11/19 13:53: 5 6/1/19 22:46 : 5 Mean :1.565
## 6/14/19 23:16: 5 6/10/19 11:17: 5 3rd Qu.:2.000
## 6/20/19 9:26 : 5 6/11/19 19:25: 5 Max. :6.000
## (Other) :19966 (Other) :19967
## trip_distance RatecodeID store_and_fwd_flag payment_type
## Min. : 0.000 Min. :1.000 N:19893 Min. :1.000
## 1st Qu.: 0.990 1st Qu.:1.000 Y: 107 1st Qu.:1.000
## Median : 1.645 Median :1.000 Median :1.000
## Mean : 3.037 Mean :1.054 Mean :1.291
## 3rd Qu.: 3.100 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :51.200 Max. :5.000 Max. :4.000
##
## fare_amount extra mta_tax tip_amount
## Min. :-160.00 Min. :-1.000 Min. :-0.5000 Min. : 0.000
## 1st Qu.: 6.50 1st Qu.: 0.000 1st Qu.: 0.5000 1st Qu.: 0.000
## Median : 9.50 Median : 0.500 Median : 0.5000 Median : 1.960
## Mean : 13.47 Mean : 1.163 Mean : 0.4949 Mean : 2.277
## 3rd Qu.: 15.00 3rd Qu.: 2.500 3rd Qu.: 0.5000 3rd Qu.: 3.000
## Max. : 399.20 Max. : 7.000 Max. : 0.5000 Max. :175.000
##
## tolls_amount improvement_surcharge total_amount
## Min. :-6.1200 Min. :-0.3000 Min. :-160.80
## 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 11.30
## Median : 0.0000 Median : 0.3000 Median : 14.80
## Mean : 0.4059 Mean : 0.2985 Mean : 19.56
## 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 21.20
## Max. :43.4300 Max. : 0.3000 Max. : 400.00
##
## congestion_surcharge Borough_pu Borough_do
## Min. :-2.500 Bronx : 35 Bronx : 159
## 1st Qu.: 2.500 Brooklyn : 242 Brooklyn : 810
## Median : 2.500 EWR : 0 EWR : 40
## Mean : 2.273 Manhattan :18088 Manhattan :17661
## 3rd Qu.: 2.500 Queens : 1466 Queens : 1073
## Max. : 2.750 Staten Island: 1 Staten Island: 3
## Unknown : 168 Unknown : 254
## Zone service_zone per_tip
## Midtown Center : 791 Airports : 459 Min. : 0.0000
## Upper East Side North : 765 Boro Zone : 2639 1st Qu.: 0.0000
## Upper East Side South : 758 EWR : 40 Median : 0.2267
## Murray Hill : 619 N/A : 254 Mean : 0.1839
## Times Sq/Theatre District: 614 Yellow Zone:16608 3rd Qu.: 0.2878
## (Other) :16380 Max. :11.4400
## NA's : 73 NA's :15
## Error in is.data.frame(x): object 'processed_df' not found
## Error in mean(processed_df$per_tip): object 'processed_df' not found
The dataset comes with an associated data definition guide.
# fare amount in negative & credit card payments
processed_df <- unprocessed_data %>% filter((fare_amount > 0) & (payment_type == 1) & (passenger_count < 7))
# removing outliers:
# per_tip laying outside IQR range in box plot
while (length(boxplot(processed_df$per_tip)$out) != 0){
dim(processed_df)
per_tip_outliers <- boxplot(processed_df$per_tip, col = c("#0000FF"))
processed_df <- processed_df[ !processed_df$per_tip %in% per_tip_outliers$out,]
dim(processed_df)
}## 'data.frame': 12965 obs. of 24 variables:
## $ DOLocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 233 231 186 234 231 161 246 68 50 132 ...
## $ X : int 14365 1376 8878 13985 9970 7812 3427 13195 9857 7422 ...
## $ VendorID : int 2 1 1 1 2 1 2 2 2 1 ...
## $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 13958 4851 10536 8018 11326 1673 8964 12138 12514 12942 ...
## $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 13903 4833 10443 7946 11238 1655 8895 12052 12466 12912 ...
## $ passenger_count : int 1 1 1 1 1 1 4 1 1 1 ...
## $ trip_distance : num 23.9 15.5 16.7 14.2 13.5 ...
## $ RatecodeID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 96 73.5 71 64 54.5 68 61 61.5 70.5 117 ...
## $ extra : num 0 1 0 0 0 0 1 0.5 1 0 ...
## $ mta_tax : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_amount : num 29.2 18.4 12 19.1 13.1 ...
## $ tolls_amount : num 20.5 17.5 10.5 12.5 10.5 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 146 110.8 93.8 96 78.4 ...
## $ congestion_surcharge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Borough_pu : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 5 ...
## $ Borough_do : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Zone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
## $ service_zone : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ per_tip : num 0.304 0.251 0.169 0.299 0.24 ...
##Pickup column
##Trip_duration
## 'data.frame': 12220 obs. of 36 variables:
## $ DOLocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 231 161 68 143 68 125 164 87 230 100 ...
## $ X : int 9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
## $ VendorID : int 2 1 2 2 1 1 2 2 2 2 ...
## $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
## $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
## $ passenger_count : int 1 1 1 1 1 1 1 1 5 2 ...
## $ trip_distance : num 13.5 17.3 16.4 17.8 15.3 ...
## $ RatecodeID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
## $ extra : num 0 0 0.5 0 0 0 0 0.5 0.5 0 ...
## $ mta_tax : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_amount : num 13.1 15.8 10 16.7 8 ...
## $ tolls_amount : num 10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 78.4 94.5 84.8 100 78.8 ...
## $ congestion_surcharge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Borough_pu : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Borough_do : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Zone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
## $ service_zone : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ per_tip : num 0.24 0.232 0.163 0.254 0.133 ...
## $ pickup_datetime : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
## $ pickup_time : chr "07:40" "12:12" "20:09" "07:11" ...
## $ pickup_hrs : chr "07" "12" "20" "07" ...
## $ pickup_hrs_num : num 7 12 20 7 6 16 12 4 3 6 ...
## $ pickup_time_type : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ dropoff_datetime : POSIXlt, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
## $ dropoff_time : chr "08:01" "12:47" "20:36" "07:45" ...
## $ dropoff_hrs : chr "08" "12" "20" "07" ...
## $ dropoff_hrs_num : num 8 12 20 7 7 16 12 4 3 7 ...
## $ dropoff_time_type : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ trip_duration_mins : 'difftime' num 21 35 27 34 ...
## ..- attr(*, "units")= chr "mins"
## $ trip_duration_mins1 : num 21 35 27 34 26 34 31 31 24 31 ...
processed_df %>%
ggplot(aes(per_tip)) +
geom_histogram(aes(y =..density..),
colour = "black",
fill = "light blue") + stat_function(fun = dnorm, args = list(mean = mean(processed_df$per_tip), sd = sd(processed_df$per_tip)))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
processed_df$passenger_count <- as.factor(processed_df$passenger_count)
processed_df %>%
group_by(passenger_count) %>%
count() %>%
ggplot(aes(passenger_count, n, fill = passenger_count)) +
geom_col() +
scale_y_sqrt() +
theme(legend.position = "none")processed_df %>%
ggplot(aes(VendorID, fill = VendorID)) +
geom_bar() +
theme(legend.position = "none")EDA questions: 1. What places give max and min tip? 2. How is the tip amount varying? 3. What time of the day gives your more tip? 4. Distribution across count of passenger? 5. Distance and time variation in the trip?
# Bar chart according to VendorID
processed_df %>% group_by(VendorID) %>% summarise(n = n(),average = mean(per_tip), min = min(per_tip), max = max(per_tip))## # A tibble: 3 x 5
## VendorID n average min max
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 4496 0.261 0.0952 0.435
## 2 2 7678 0.264 0.0952 0.434
## 3 4 46 0.269 0.108 0.416
ggplot(processed_df, aes(y=per_tip, group = VendorID)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)hist(check_2$per_tip,probability=T, main="Histogram of normal
data",xlab="Approximately normally distributed data")## Error in hist(check_2$per_tip, probability = T, main = "Histogram of normal\ndata", : object 'check_2' not found
## Error in density(check_2$per_tip): object 'check_2' not found
hist(processed_df$per_tip,probability=T, main="Histogram of normal
data",xlab="Approximately normally distributed data")
lines(density(processed_df$per_tip),col=2)## [1] 0.07113667
## [1] 0.2630256
# TANAYA
# anova tests
anova_tip_amount = aov(per_tip ~ passenger_count, data = processed_df)
summary(anova_tip_amount)## Df Sum Sq Mean Sq F value Pr(>F)
## passenger_count 6 0.06 0.010469 2.07 0.0533 .
## Residuals 12213 61.77 0.005058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.6.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1
##
## Anderson-Darling normality test
##
## data: processed_df$per_tip
## A = 51.601, p-value < 2.2e-16
## Df Sum Sq Mean Sq F value Pr(>F)
## pickup_time_type 3 0.20 0.06797 13.47 8.97e-09 ***
## Residuals 12216 61.63 0.00504
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = per_tip ~ pickup_time_type, data = processed_df)
##
## $pickup_time_type
## diff lwr upr p adj
## Afternoon-Morning 0.002581970 -0.0020717179 0.0072356582 0.4833023
## Evening-Morning 0.010398583 0.0059664376 0.0148307280 0.0000000
## Night-Morning 0.005867182 0.0009774036 0.0107569603 0.0110525
## Evening-Afternoon 0.007816613 0.0032848786 0.0123483466 0.0000557
## Night-Afternoon 0.003285212 -0.0016950125 0.0082654361 0.3263547
## Night-Evening -0.004531401 -0.0093052601 0.0002424584 0.0700097
p<0.05 hence we reject null hypothesis null hyp = means for all the day types are equal conclusion : Avg tip amt for diffrent time in the day is not equal not all of the means are equal and the results are statistically significant
## Df Sum Sq Mean Sq F value Pr(>F)
## dropoff_time_type 3 0.22 0.07189 14.25 2.87e-09 ***
## Residuals 12216 61.62 0.00504
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = per_tip ~ dropoff_time_type, data = processed_df)
##
## $dropoff_time_type
## diff lwr upr p adj
## Afternoon-Morning 0.002019670 -0.0026711257 0.006710466 0.6856715
## Evening-Morning 0.010641669 0.0061625724 0.015120765 0.0000000
## Night-Morning 0.004394981 -0.0004606098 0.009250572 0.0922912
## Evening-Afternoon 0.008621999 0.0040836819 0.013160315 0.0000064
## Night-Afternoon 0.002375311 -0.0025349617 0.007285584 0.5994586
## Night-Evening -0.006246688 -0.0109551391 -0.001538236 0.0036582
p<0.05 hence we reject null hypothesis null hyp = means for all the day types are equal conclusion : Avg tip amt for diffrent time in the day is not equal not all of the means are equal and the results are statistically significant
## Warning in if (freq) x$counts else x$density: the condition has length > 1
## and only the first element will be used
## Warning in if (!freq) "Density" else "Frequency": the condition has length
## > 1 and only the first element will be used
##
## Welch Two Sample t-test
##
## data: processed_df$per_tip and processed_df$trip_duration_mins1
## t = -179.56, df = 12221, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.88660 -12.60829
## sample estimates:
## mean of x mean of y
## 0.2630256 13.0104746
## Warning in cor(processed_df_num): the standard deviation is zero
p<0.05 hence we reject null hypothesis null hyp = means for per_tip and trip_duration_min are equal conclusion : true difference in means is not equal to 0 means are not equal and the results are statistically significant
<<<<<<< HEAD #Madhuri Ends #################################### ======= >>>>>>> 29d0f71f0251384c4f2b8c5e24b3c721927ff061
Variables…
Which test to perform and why?
Assumptions/Criteria to be fulfilled:# AMNA
# i am going to work on 2 variable cols, dependent variable is tip_amount & independent variable is trip_distance so subsetting my variables into a new df
df_dist <- subset(processed_df, select=c(tip_amount, trip_distance))
glimpse(df_dist)## Observations: 12,220
## Variables: 2
## $ tip_amount <dbl> 13.06, 15.75, 10.00, 16.66, 8.00, 12.00, 17.20, 14…
## $ trip_distance <dbl> 13.46, 17.30, 16.41, 17.84, 15.30, 13.30, 14.27, 1…
# Now that I know my variables, which test to choose. I cannot use z-test because I dont know population mean & std dev. cannot use one sample t-test because I dont have a pre-determined population mean or some other theoretically derived value with which i could compare the mean value of my observed sample. So going to use independent two-sample t-test, a significance test that can give us an estimate as to whether different means between two groups are the result of random variation or the product of specific characteristics within the groups. So I divide my independent variable i.e. distances covered in miles during each ride into "two factored categorical data".
factor_dist <- cut(df_dist$trip_distance, breaks=c(0, mean(df_dist$trip_distance), Inf), labels= c("Shorter Distances","Longer Distances"), 1:2, sep="")
df_dist$trip_distance <- factor_dist # assigning factored col to the original df
glimpse(df_dist)## Observations: 12,220
## Variables: 2
## $ tip_amount <dbl> 13.06, 15.75, 10.00, 16.66, 8.00, 12.00, 17.20, 14…
## $ trip_distance <fct> Longer Distances, Longer Distances, Longer Distanc…
# Lets have a look at them
# Basic histogram of tip variable
theme_update(plot.title = element_text(hjust = 0.5)) # once I run this line all chart titles onwards will be centered instead of default left-aligned
ggplot(df_dist, aes(x=tip_amount)) +
geom_histogram(binwidth=0.1, color="green") + coord_cartesian(xlim=c(0,20)) +
labs(x="Tips($)", y = "Count") + ggtitle("Histogram for Tip values")# ?? Some of the values look like outliers and hence skewing the whole distributions so I am going to remove them from my dataset treating them as outliers ??
# Plot combining values from both of my variables
# Use semi-transparent (alpha=0.5) fill ??how to change legend title??
p<-ggplot(df_dist, aes(x=tip_amount, fill=trip_distance, color=trip_distance)) +
geom_histogram(binwidth=0.1, position="identity", alpha=0.5) + coord_cartesian(xlim=c(0,20)) +
labs(x="Tips($)", y = "Count") + ggtitle("Histogram of Tip values by Distance travelled")
p# Judging by the shape of this plot, I am unable to say whether there is a Relationship between these two variables or not.
# But before applying the 2 sampled t-test, I first need to fulfull some conditions for reliable results. i.e. random, normal, independent
# All set. Now the 1st step in significance testing is to state your hypothesis. Null: Average amount tipped by the passenger(s) travelling shorter distances equals that of longer distance travellers.
str(processed_df) ## 'data.frame': 12220 obs. of 36 variables:
## $ DOLocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 231 161 68 143 68 125 164 87 230 100 ...
## $ X : int 9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
## $ VendorID : int 2 1 2 2 1 1 2 2 2 2 ...
## $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
## $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
## $ passenger_count : Factor w/ 7 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 6 3 ...
## $ trip_distance : num 13.5 17.3 16.4 17.8 15.3 ...
## $ RatecodeID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
## $ extra : num 0 0 0.5 0 0 0 0 0.5 0.5 0 ...
## $ mta_tax : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_amount : num 13.1 15.8 10 16.7 8 ...
## $ tolls_amount : num 10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 78.4 94.5 84.8 100 78.8 ...
## $ congestion_surcharge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Borough_pu : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Borough_do : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Zone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
## $ service_zone : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ per_tip : num 0.24 0.232 0.163 0.254 0.133 ...
## $ pickup_datetime : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
## $ pickup_time : chr "07:40" "12:12" "20:09" "07:11" ...
## $ pickup_hrs : chr "07" "12" "20" "07" ...
## $ pickup_hrs_num : num 7 12 20 7 6 16 12 4 3 6 ...
## $ pickup_time_type : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ dropoff_datetime : POSIXlt, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
## $ dropoff_time : chr "08:01" "12:47" "20:36" "07:45" ...
## $ dropoff_hrs : chr "08" "12" "20" "07" ...
## $ dropoff_hrs_num : num 8 12 20 7 7 16 12 4 3 7 ...
## $ dropoff_time_type : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ trip_duration_mins : 'difftime' num 21 35 27 34 ...
## ..- attr(*, "units")= chr "mins"
## $ trip_duration_mins1 : num 21 35 27 34 26 34 31 31 24 31 ...
# sample size is large enough approx 13,000 observations so CLT applies and we can say our data is normally distributed
# shapiro.test(processed_df$tip_amount)
# For Shapiro, sample size must be between 3 and 5000
summary(processed_df$per_tip)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.09524 0.22615 0.26609 0.26303 0.30857 0.43478
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.010 1.600 2.456 2.700 25.100
# not so good/visible results
plot(processed_df$trip_distance, processed_df$per_tip, main = "Scatter plot between Percentage Tip Amount & Distance travelled")# even bad results, what to do
# plot(processed_df$trip_distance, processed_df$tip_amount, main = "Scatter plot between Percentage Tip Amount & Distance
# travelled (Log)", log = processed_df$tip_amount)
# should i take log etc
hist(processed_df$tip_amount, main="Tip Amount Histogram", xlab = "Tip ($)", col = "aquamarine") # not normal
hist(processed_df$per_tip, main="Tip Percentage Histogram", xlab = "Tip Percentage", col = "aquamarine") # pretty normal
hist(processed_df$trip_distance, main="Trip Distance (miles) Histogram", xlab = "Trip Distance (miles)", col = "aquamarine")# not normal
qqnorm(processed_df$tip_amount, main="Normal Q-Q Plot for Tip Amount", xlab="Theoretical Quantiles (z-scores)", ylab="Tip values", col="darkviolet")
qqline(processed_df$tip_amount) # end points digress
qqnorm(processed_df$per_tip, main="Normal Q-Q Plot for Tip Percentages", xlab="Theoretical Quantiles (z-scores)", ylab="Tip Percentage values", col="darkviolet")
qqline(processed_df$per_tip) # much better results than simple tip amount
qqnorm(processed_df$trip_distance, main="Normal Q-Q Plot for Trip Distances", xlab="Theoretical Quantiles (z-scores)", ylab="Trip Distances values", col="darkviolet")
qqline(processed_df$trip_distance) # same result as tip_amount, end points digress
# cannot perform Shapiro because observations (approx 13,000) > 5000
mean_dist <- mean(processed_df$trip_distance)
# should i use mean or median bc this col in not normal
str(processed_df)## 'data.frame': 12220 obs. of 36 variables:
## $ DOLocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PULocationID : int 231 161 68 143 68 125 164 87 230 100 ...
## $ X : int 9970 7812 13195 12280 2391 15423 1082 1862 9274 11863 ...
## $ VendorID : int 2 1 2 2 1 1 2 2 2 2 ...
## $ tpep_pickup_datetime : Factor w/ 15407 levels "5/31/19 23:58",..: 11326 1673 12138 9232 7771 14662 6282 3135 6642 14376 ...
## $ tpep_dropoff_datetime: Factor w/ 15299 levels "6/1/19 0:04",..: 11238 1655 12052 9152 7686 14570 6243 3092 6579 14275 ...
## $ passenger_count : Factor w/ 7 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 6 3 ...
## $ trip_distance : num 13.5 17.3 16.4 17.8 15.3 ...
## $ RatecodeID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fare_amount : num 54.5 68 61.5 65.5 60 57.5 58 69.5 64 65.5 ...
## $ extra : num 0 0 0.5 0 0 0 0 0.5 0.5 0 ...
## $ mta_tax : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tip_amount : num 13.1 15.8 10 16.7 8 ...
## $ tolls_amount : num 10.5 10.5 12.5 17.5 10.5 23 10.5 23 10.5 17.5 ...
## $ improvement_surcharge: num 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ total_amount : num 78.4 94.5 84.8 100 78.8 ...
## $ congestion_surcharge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Borough_pu : Factor w/ 7 levels "Bronx","Brooklyn",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Borough_do : Factor w/ 7 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Zone : Factor w/ 261 levels "Allerton/Pelham Gardens",..: 169 169 169 169 169 169 169 169 169 169 ...
## $ service_zone : Factor w/ 5 levels "Airports","Boro Zone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ per_tip : num 0.24 0.232 0.163 0.254 0.133 ...
## $ pickup_datetime : POSIXct, format: "0019-06-29 07:40:00" "0019-06-12 12:12:00" ...
## $ pickup_time : chr "07:40" "12:12" "20:09" "07:11" ...
## $ pickup_hrs : chr "07" "12" "20" "07" ...
## $ pickup_hrs_num : num 7 12 20 7 6 16 12 4 3 6 ...
## $ pickup_time_type : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ dropoff_datetime : POSIXlt, format: "0019-06-29 08:01:00" "0019-06-12 12:47:00" ...
## $ dropoff_time : chr "08:01" "12:47" "20:36" "07:45" ...
## $ dropoff_hrs : chr "08" "12" "20" "07" ...
## $ dropoff_hrs_num : num 8 12 20 7 7 16 12 4 3 7 ...
## $ dropoff_time_type : Ord.factor w/ 4 levels "Morning"<"Afternoon"<..: 1 2 3 1 1 2 2 4 4 1 ...
## $ trip_duration_mins : 'difftime' num 21 35 27 34 ...
## ..- attr(*, "units")= chr "mins"
## $ trip_duration_mins1 : num 21 35 27 34 26 34 31 31 24 31 ...
short_dist <- subset(processed_df, processed_df$trip_distance == "Distance1")
long_dist <- subset(processed_df, processed_df$trip_distance == "Distance2")
# subsetting df on basis of trip distance
# Applying 2 Sample t-test below
result <- t.test(short_dist$tip_amount, long_dist$tip_amount)## Error in t.test.default(short_dist$tip_amount, long_dist$tip_amount): not enough 'x' observations
## Error in print(result): object 'result' not found
# My Null Hyp = Both means equal, Alt Hyp = Both means not equal
plot(processed_df$per_tip ~ processed_df$trip_distance, data=processed_df , main="Tips paid by Short vs High Distance Commuters (Mean)", ylab="Tips Paid", xlab="Distance travelled", col=c("#ff0000","#00ff00") )
M = tapply(processed_df$per_tip,
INDEX = processed_df$trip_distance,
FUN = mean)
points(M,
col="yellow",
pch="+", ### symbol to use, see ?points
cex=2)3.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1
As previously discussed, location can also impact the amount of tipping. This is because each location has its own set of customs and traditions and socioeconomic variables that affect tipping [1]. In this context, a person’s wealth may also impact the amount that he or she tips to the taxi driver. While most of this analysis was done on a larger, regional scale, given the distinct culture of each borough in New York, there may be microcosims of tipping culture found in the city that differ between each borough. The dataset does not provide information on the passenger’s home location, wealth, or even purpose of travel; it simply provides pick up and drop off locations. However, pick up and drop off locations may provide some context into a passenger’s background for two main reasons. The first reason is quite simple: Passengers may use taxis to go from home to work and from work to home [2]. Secondly, each individual has their own time-space geography. Some geographers have argued that people’s time-spaces are often segregated–whether due to gender or race–which can affect their access to resources. Therefore, this may provide some context into how an individual lives.
[1] https://scholarship.sha.cornell.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1102&context=articles [2] https://ieeexplore.ieee.org/abstract/document/4624004; https://onlinelibrary.wiley.com/doi/pdf/10.1111/0033-0124.00158?casa_token=fgUlwmzlofwAAAAA:k19TGhhaKN9uLpNH1QXQgXrqRrm77fEgGmBisvuspaVDzvnNfXkrcvBvC4QsJMOmptyw8ew-qJWCd4s; https://biblio.ugent.be/publication/3029997/file/6779790.pdf; https://www.tandfonline.com/doi/full/10.1080/02723638.2016.1142152
To better understand this issue, descriptive statistics were run. First, we calculate the number of trips in each borough, firstly grouped by pickup location and secondly grouped by drop off location.
These bar charts show that Manhattan has the highest number of both pick up and drop offs, followed by Queens and Brooklyn in second and third, respectively. We also looked at the frequency of various drop off and pick up combinations.
## Error in attributes(out) <- attributes(col): 'names' attribute [11] must be the same length as the vector [1]
This table shows that Manhattan to Manhattan has the highest number of trips, followed by Queens to Manhattan, Manhattan to Queens, and Manhattan to Brooklyn.
Considering the Manhattan and Queens have the highest number of pick ups and drop offs, the fact that the number of trips within and between these places are also the highest make sense. The fact that Manhattan scores the highest in both measurements is also reasonable because yellow taxis (the focus of this study) mainly serve Manhattan, whereas green taxis usually serve the other boroughs that have been traditionally underserved by taxis (SOURCE).
With these descriptive statistics in mind, we decided to compare the mean tipping percentage for each borough based on drop off and pick up location to see if they were statistically different.
We used the ANOVA test. Here are the results:
## Call:
## aov(formula = per_tip ~ Borough_pu, data = processed_df)
##
## Terms:
## Borough_pu Residuals
## Sum of Squares 0.50296 61.33039
## Deg. of Freedom 4 12215
##
## Residual standard error: 0.07085836
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_pu 4 0.50 0.12574 25.04 <2e-16 ***
## Residuals 12215 61.33 0.00502
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## aov(formula = per_tip ~ Borough_do, data = processed_df)
##
## Terms:
## Borough_do Residuals
## Sum of Squares 0.74527 61.08808
## Deg. of Freedom 6 12213
##
## Residual standard error: 0.07072404
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_do 6 0.75 0.1242 24.83 <2e-16 ***
## Residuals 12213 61.09 0.0050
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values are 1.09918810^{-20} and 1.932089210^{-29} for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.
Because they are significant, the next step would be to conduct a Tukey’s HSD test, which looks at each pair of variables to see if they are significantly different. Here are the results from that analysis:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = per_tip ~ Borough_pu, data = processed_df)
##
## $Borough_pu
## diff lwr upr p adj
## Brooklyn-Bronx -0.147655747 -0.341855003 0.04654351 0.2312841
## Manhattan-Bronx -0.112689738 -0.306012622 0.08063315 0.5036324
## Queens-Bronx -0.141636653 -0.335165845 0.05189254 0.2675855
## Unknown-Bronx -0.119411173 -0.313626929 0.07480458 0.4480395
## Manhattan-Brooklyn 0.034966008 0.016362693 0.05356932 0.0000030
## Queens-Brooklyn 0.006019094 -0.014618111 0.02665630 0.9319400
## Unknown-Brooklyn 0.028244573 0.001936672 0.05455247 0.0281303
## Queens-Manhattan -0.028946915 -0.038235632 -0.01965820 0.0000000
## Unknown-Manhattan -0.006721435 -0.025496198 0.01205333 0.8658243
## Unknown-Queens 0.022225480 0.001433592 0.04301737 0.0292158
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = per_tip ~ Borough_do, data = processed_df)
##
## $Borough_do
## diff lwr upr p adj
## Brooklyn-Bronx 0.007440429 -0.029833482 0.044714340 0.9971509
## EWR-Bronx 0.005236027 -0.064790592 0.075262646 0.9999905
## Manhattan-Bronx 0.041394710 0.005574339 0.077215081 0.0117042
## Queens-Bronx 0.017922932 -0.019339844 0.055185707 0.7921154
## Staten Island-Bronx 0.055393986 -0.156202712 0.266990684 0.9875895
## Unknown-Bronx 0.020288045 -0.019916376 0.060492466 0.7521602
## EWR-Brooklyn -0.002204402 -0.063315819 0.058907016 0.9999999
## Manhattan-Brooklyn 0.033954281 0.023278280 0.044630282 0.0000000
## Queens-Brooklyn 0.010482503 -0.004329399 0.025294405 0.3604522
## Staten Island-Brooklyn 0.047953557 -0.160862248 0.256769362 0.9938422
## Unknown-Brooklyn 0.012847617 -0.008301224 0.033996457 0.5539288
## Manhattan-EWR 0.036158683 -0.024077186 0.096394552 0.5684189
## Queens-EWR 0.012686905 -0.048417721 0.073791531 0.9964559
## Staten Island-EWR 0.050157959 -0.166909826 0.267225744 0.9936320
## Unknown-EWR 0.015052018 -0.047889672 0.077993708 0.9923343
## Queens-Manhattan -0.023471778 -0.034108837 -0.012834720 0.0000000
## Staten Island-Manhattan 0.013999276 -0.194561974 0.222560526 0.9999950
## Unknown-Manhattan -0.021106665 -0.039573609 -0.002639721 0.0133007
## Staten Island-Queens 0.037471054 -0.171342764 0.246284872 0.9984320
## Unknown-Queens 0.002365113 -0.018764095 0.023494322 0.9998970
## Unknown-Staten Island -0.035105941 -0.244464703 0.174252822 0.9989324
While the means are overall not the same, the following pick up pairs have significant differences in tipping percentage (excluding Unknown) given their small p-values: Manhattan and Brooklyn, and Queens and Manhattan. For drop off pairs, Manhattan and Bronx, Manhattan and Brooklyn, and Manhattan and Queens are significant.
ANOVA assumes a normal distribution, and as previously highlighted, the tipping amount is not necessarily normally distributed. Nonetheless, a look at the tipping amount can provide some context to the situation. We compared the mean tipping amount for each borough to see if those were statistically different. Here are the results for raw tip amount based on location:
## Call:
## aov(formula = tip_amount ~ Borough_pu, data = processed_df)
##
## Terms:
## Borough_pu Residuals
## Sum of Squares 9594.53 37995.94
## Deg. of Freedom 4 12215
##
## Residual standard error: 1.763689
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_pu 4 9595 2398.6 771.1 <2e-16 ***
## Residuals 12215 37996 3.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## aov(formula = tip_amount ~ Borough_do, data = processed_df)
##
## Terms:
## Borough_do Residuals
## Sum of Squares 8972.01 38618.47
## Deg. of Freedom 6 12213
##
## Residual standard error: 1.778223
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Borough_do 6 8972 1495.3 472.9 <2e-16 ***
## Residuals 12213 38618 3.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For these ANOVA analyses, the null hypothesis is that all means for different locations are the same. The p-values are 0 and 0 for pick up and drop off, respectively. Both of these are smaller than a significance level of 0.05 (a 0.95 confidence level). Thus, we can reject the null hypothesis that the means are the same and say the means are statistically different at a significance level of 0.05.
As with the previous, because both are significant, we can run Tukey’s HSD test:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_amount ~ Borough_pu, data = processed_df)
##
## $Borough_pu
## diff lwr upr p adj
## Brooklyn-Bronx 1.7965138 -3.0371711 6.63019867 0.8490671
## Manhattan-Bronx 1.4192781 -3.3925936 6.23114980 0.9292692
## Queens-Bronx 6.0870444 1.2700377 10.90405122 0.0051451
## Unknown-Bronx 2.7872897 -2.0468058 7.62138529 0.5148065
## Manhattan-Brooklyn -0.3772357 -0.8402784 0.08580714 0.1713150
## Queens-Brooklyn 4.2905307 3.7768637 4.80419765 0.0000000
## Unknown-Brooklyn 0.9907760 0.3359634 1.64558849 0.0003552
## Queens-Manhattan 4.6677663 4.4365670 4.89896563 0.0000000
## Unknown-Manhattan 1.3680116 0.9007014 1.83532179 0.0000000
## Unknown-Queens -3.2997547 -3.8172718 -2.78223764 0.0000000
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = tip_amount ~ Borough_do, data = processed_df)
##
## $Borough_do
## diff lwr upr p adj
## Brooklyn-Bronx -0.9486873 -1.8858699 -0.01150461 0.0450159
## EWR-Bronx 8.6900490 6.9293609 10.45073715 0.0000000
## Manhattan-Bronx -3.0039359 -3.9045720 -2.10329976 0.0000000
## Queens-Bronx 0.4975155 -0.4393872 1.43441820 0.7042679
## Staten Island-Bronx 5.6108824 0.2906798 10.93108488 0.0309347
## Unknown-Bronx -0.1438463 -1.1547112 0.86701851 0.9995836
## EWR-Brooklyn 9.6387363 8.1022042 11.17526838 0.0000000
## Manhattan-Brooklyn -2.0552486 -2.3236767 -1.78682058 0.0000000
## Queens-Brooklyn 1.4462028 1.0737853 1.81862031 0.0000000
## Staten Island-Brooklyn 6.5595696 1.3092874 11.80985181 0.0043210
## Unknown-Brooklyn 0.8048409 0.2730930 1.33658891 0.0001652
## Manhattan-EWR -11.6939849 -13.2085030 -10.17946684 0.0000000
## Queens-EWR -8.1925335 -9.7288948 -6.65617216 0.0000000
## Staten Island-EWR -3.0791667 -8.5369294 2.37859610 0.6404759
## Unknown-EWR -8.8338953 -10.4164462 -7.25134448 0.0000000
## Queens-Manhattan 3.5014514 3.2340025 3.76890031 0.0000000
## Staten Island-Manhattan 8.6148182 3.3709364 13.85870012 0.0000265
## Unknown-Manhattan 2.8600896 2.3957728 3.32440627 0.0000000
## Staten Island-Queens 5.1133668 -0.1368654 10.36359906 0.0621414
## Unknown-Queens -0.6413618 -1.1726162 -0.11010748 0.0068348
## Unknown-Staten Island -5.7547287 -11.0186625 -0.49079484 0.0216086
For pick up locations, there is no statistical difference for the between the following pairs (excluding Unknown) given their large p-values: Brooklyn and Bronx, and Manhattan and Bronx. For drop off locations, there is no difference for: Queens and Bronx, Staten Island and Bronz, Staten Island and Brooklyn, Staten Island and EWR, and Staten Island and Queens. All the Manhattan drop off locations are significant.
Based on this analysis, it seems that being dropped off in Manhattan is significantly different from being dropped off in another location. The same seems to be true for being picked up in Manhattan. FINISH THIS
<<<<<<< HEAD
# Fit linear model with 3 variables
lm_3var <- lm(per_tip ~ Borough_pu, data = processed_df)
# Print summary
summary(lm_3var)##
## Call:
## lm(formula = per_tip ~ Borough_pu, data = processed_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.169215 -0.035061 0.003729 0.044118 0.196045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37714 0.07086 5.322 1.04e-07 ***
## Borough_puBrooklyn -0.14766 0.07118 -2.074 0.0381 *
## Borough_puManhattan -0.11269 0.07086 -1.590 0.1118
## Borough_puQueens -0.14164 0.07094 -1.997 0.0459 *
## Borough_puUnknown -0.11941 0.07119 -1.677 0.0935 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07086 on 12215 degrees of freedom
## Multiple R-squared: 0.008134, Adjusted R-squared: 0.007809
## F-statistic: 25.04 on 4 and 12215 DF, p-value: < 2.2e-16
======= >>>>>>> 29d0f71f0251384c4f2b8c5e24b3c721927ff061
One limitation was hardware limitations. Another limitation was that cash tips by nature are not documented. Also, as previously discussed, our data was not perfectly normally distributed. While it was roughly normal, not being normally distributed may slightly contradict some of the assumptions of the statistical analyses.
We might have to do just an analysis on the neighborhoods within Manhattan since yellow taxis are mainly found in Manhattan. Or, we could combine the yellow taxi data with green taxi data to do an analysis that encompasses all five boroughs.